-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[storage] Atomic alter table in the storage controller #31408
Conversation
a4ee72f
to
016a188
Compare
A hypothesisAt the storage collections level, alter table is roughly:
Observations:
So: suppose we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this can happen! At first I was skeptical because I thought re-acquiring the since handle would protect us, because I mistakenly thought that someone somewhere would have to fail because of a mismatched since when trying to downgrade. But, we only compare against the epoch: when downgrading a critical since handle we don't check that the actual since frontier is what we expect. So in the alter table situation it can (does) happen that there are two valid critical since handles in the process that can downgrade the since (the one we freshly acquired, and the one in the background worker), and they both map to the one logical since handle (the one since handle ID).
Very nice find! 🙌
Great! So far the only nightly errors are either also present on main, or seem like flakes - since they don't alter any tables, so this code wouldn't run - but I'll try and confirm that before merging. |
This avoids a possible race in the storage controller, where a downgrade of our read capability can race the selection of new capabilities for a new version of the table. (More detail in a comment downthread.)
We instead switch to:
...which together should avoid this particular race.
Motivation
https://github.com/MaterializeInc/database-issues/issues/8952
Tips for reviewer
This bug is quite tricky to reproduce, so I'm not 100% confident that this is the issue we're seeing in CI. Happy to take suggestions for improving the tests! But also please check my logic below: that it sounds like an accurate description of the current behaviour, and that the particular interleaving of events it imagines would explain the symptoms we see.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.